2025 iThome 鐵人賽

DAY 3

Software Development

從零開始Pandas-外加一點Matplotlib系列第 3 篇

Day2: 初探Data Structure

17th鐵人賽

Hilda

2025-09-16 23:58:28

65 瀏覽

分享至

Pandas的資料結構有兩種
Series：一維資料，你可以當它是一個陣列
DataFrame：二維資料，一個有列(row)和欄位(column)的表格
每一個Column都是一個Series，多個series組成一個dataframe
DataFrame是我們熟悉的老朋友，就是excel或是database table

Series

前面說到每一個column都是一個series，Series是什麼？
Series是一個可以保存任何資料類型(整數、字串、浮點數、python objects)的一維陣列。但一般的陣列的index就是從0開始的數值，series可以指定索引標籤index label，索引標籤就像字典的key可以方便我們存取資料。

建立Series

create by array

最簡單的方式是給一個陣列

pd.Series([1,2,3,4,5])

create by dictionary

或者我們也可以從字典生成一個Series，Pandas將會把key作為索引標籤儲存

d = {"a": 0.0, "b": 1.0, "c": 2.0}
pd.Series(d)

create by numpy

或者我們可以仰賴numpy的力量，首先不免俗的需要import numpy

import numpy as np
s_withoutindex = pd.Series(np.random.rand(5))
s_withindex  = pd.Series(np.random.rand(5), index=["a", "b", "c", "d", "e"])
print(s_withoutindex)
print(s_withindex)

np.random.rand(5) 意思是請numpy幫我們隨機產生0到1之間的5個數字

若要指定索引標籤，需要指定於 pd.Series中的屬性index，這裡的index指的是index label就是索引標籤。將s_withoutindex和s_withindex 列印出來後，可以看到index就會從數字變成我們賦給它的label。

INDEX	VALUE	INDEX	VALUE
s_withoutindex	Result	s_withindex	Result
0	0.790777	a	0.070450
1	0.635870	b	0.485292
2	-0.218981	c	1.075480
3	1.102360	d	-0.116105
4	-0.102805	e	-0.833574

要是index和series data長度不符，會出現：ValueError: Length of values (5) does not match length of index (4)

讀取Series

可以直接讀取整個series，或者針對index / index label讀取

s_withindex # 讀取整個series
s_withindex[0]
s_withindex['a']

有指定索引標籤的data，可以用index和index label，沒有指定index label的data就只能用index。

對Series運算

pandas支援向量運算，可以在不跑迴圈的前提下對整個series做計算，效能讚讚！

s_withindex = s_withindex + 2

其他的運算，像是減法、乘法、除法、指數運算也都是沒問題的。

s_withindex = s_withindex * 2

DataFrame

dataframe物件包含三個屬性

index索引
columns欄位
values資料

一樣先簡單粗暴讀個檔，我們讀取一個名為ubike.csv的檔案，放到名為ubikes的dataframe物件
利用這個data來觀察三個屬性的內容

import pandas as pd
ubikes = pd.read_csv('ubike.csv')
index = ubikes.index
columns = ubikes.columns
value = ubikes.values

ubike檔案：https://data.gov.tw/dataset/135775
下載後是編碼是big5，記得轉成utf8

index

RangeIndex(start=0, stop=55, step=1)

回傳一個rangeindex物件，其結果代表data索引值從0開始到55，間隔值為1，跟python原則一樣，索引值左閉右開，實際data不包含最後一個索引值55，是0-54，總共55筆。

columns

Index(['民國年', '西元年', '月份', '發布機關名稱', '機關代碼', '臺北市YouBike每月使用量（次數）'], dtype='object')

將所有的欄位儲存於index物件中

value

[[109 2020 11 '臺北市政府交通局' '379530000H' 2730442]###
[109 2020 12 '臺北市政府交通局' '379530000H' 2072168]
[110 2021 1 '臺北市政府交通局' '379530000H' 2291365]

回傳一個ndarrays物件，及所有值的內容

取得series資料

要怎麼讀取ubikes的column呢？先看一下csv檔的內容

欄位有這些：
1 民國年 2 西元年 3 月份 4 發布機關名稱 5 機關代碼 6 臺北市YouBike每月使用量（次數）
假設我們要讀取西元年，有兩種方式：

在dataframe後加上大括號[]，裡面指定欄位名稱
在dataframe後用屬性的方式呼叫欄位

print(ubikes["西元年"])
print(ubikes.西元年)

這些series欄位也可以用來做前述的series運算

如果我只要特定row的data呢

可使用loc/iloc選擇器，用切片或串列的方式指定索引和欄位

DataFrame.loc[:, 'ColumnName']
DataFrame.iloc[:, ColumnIndex]

ubikes.loc[2:5,'西元年']
ubikes.loc[[1,3,5,6],'西元年']

ubikes.iloc[2:5,2]
ubikes.iloc[[1,3,5,6], 2]

建立DataFrame

手動建立DataFrame的方式很靈活，可以是series或是ndarrays的dict，也可以直接塞到dictioanry list進去，進去的data可能長短腳怎麼辬？首先沒對上的欄位會是nan，不過別擔心，Pandas處理缺失值也是so ez，以後會繼續分享。

create by dicionary of Series

d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)

create by dicionary of ndarrays / lists

d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
pd.DataFrame(d)

create by dicionary list

data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
pd.DataFrame(data2)

Day1: Why Pandas

Day3: 欄位異動的操作

系列文

從零開始Pandas-外加一點Matplotlib 共 13 篇

RSS系列文訂閱系列文

0 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19838 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

從零開始Pandas-外加一點Matplotlib系列 第 3 篇